Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance
نویسندگان
چکیده
Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that efficiently exploits GPU optimizations on multicore CPUs. Our implementation translates SIMT parallelism into SIMD vectorization and SIMT coalescing into cache-efficient access patterns. These translations are especially challenging when control divergence is present. Our system addresses divergence through a multi-tier vectorization approach based on dynamic convergence checking. The proposed approach outperforms existing industry implementations achieving geometric mean speedups of 2.26× and 1.09× over AMD’s and Intel’s OpenCL implementations respectively.
منابع مشابه
Importance of Explicit Vectorization for CPU and GPU Software Performance
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU imp...
متن کاملAccelerating High-Dimensional Nearest Neighbors for Video Search
The k-nearest neighbor algorithm (kNN) is a critical algorithm used extensively in fields such as Computer Vision, Robotics, and Machine Learning. In this work, we address the performance of FLANN, a popular kNN library, at the node-level by co-designing indexing and search algorithms with software support. We characterize, profile, and optimize FLANN for high-dimensionality (e.g., ≥ 4096) for ...
متن کاملCross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model
Current multiand many-core computing typically involves multi-core Central Processing Units (CPU) and many-core Graphical Processing Units (GPU) whose architectures are distinctly different. To keep longevity of application codes, it is highly desirable to have a programming paradigm to support these current and future architectures. Open Computing Language (OpenCL) is created to address this p...
متن کاملTuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures
This paper presents optimizations to Principal Component Analysis (PCA) in GRASS GIS. The current implementation of PCA in GRASS is based on eigenvalue decomposition, which does not have high memory requirements but it can suffer from low runtime performance. In modern computers, significant performance improvements can be achieved by appropriately taking advantage of the memory configuration (...
متن کاملWeld: Fast Data-Parallel Computation on Modern Hardware
Modern hardware is difficult to use efficiently, requiring complex optimizations like vectorization, loop blocking and load balancing to get good performance. As a result, many widely used data processing systems fall well short of peak hardware performance. We have developed Weld, an intermediate language and runtime that can run data-parallel computations efficiently on modern hardware. The c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015